30 research outputs found
Use of Weighted Finite State Transducers in Part of Speech Tagging
This paper addresses issues in part of speech disambiguation using
finite-state transducers and presents two main contributions to the field. One
of them is the use of finite-state machines for part of speech tagging.
Linguistic and statistical information is represented in terms of weights on
transitions in weighted finite-state transducers. Another contribution is the
successful combination of techniques -- linguistic and statistical -- for word
disambiguation, compounded with the notion of word classes.Comment: uses psfig, ipamac
Recommended from our members
GIST-IT: Summarizing Email Using Linguistic Knowledge and Machine
We present a system for the automatic extraction of salient information from email messages, thus providing the gist of their meaning. Dealing with email raises several challenges that we address in this paper: heterogeneous data in terms of length and topic. Our method combines shallow linguistic processing with machine learning to extract phrasal units that are representative of email content. The GIST-IT application is fully implemented and embedded in an active mailbox platform. Evaluation was performed over three machine learning paradigms
GIST-IT: Summarizing Email Using Linguistic Knowledge and Machine
The setting is a small Mississippi River town in the 1830s and the characters are the children and grown ups of the town. Tom Sawyer is the main character and you follow him around during the book. (Kuiper 1122) This book was largely based on Mark Twain\u92s boyhood. The famous whitewashing scene actually happened. Mark was Tom getting other little boys to do his work. He also got lost in that very same cave. Huck Finn was the same way. Huck was based on Twain\u92s boyhood friend/ \u93idol\u94, Tom Blankenship. Tom Sawyer, the main character of the work, is hardly the \u93model boy\u94. He is just like any other boy, mischievous and irresponsible, yet goodhearted. He reminds us all at how we used to be at that age. We did what ever we could to have fun. He is a thirteen year old boy filled with adventures and excitement
Issues In Text-To-Speech For French
This paper reports the progress of the French tcxt4o-speech system being developed at AT&T Bell Laboratories as part of a larger project for multilingual text-to-speech systems, including languages such as Spanish, Italian, German, Russjam and Chinese. These systems, based on diphone and triphone concatenation, follow the general framework of the Bell Laboratories English TTS system [?], [?]. This paper provides a description of the approach, the current status of the French text-to-peech project, and some problems particular to French
La synthèse de la parole et le traitement automatique des langues.
International audienc
Information Retrieval Based on Context Distance and Morphology
We present an approach to information retrieval based on context distance and morphology. Context distance is a measure we use to assess the closeness of word meanings. This context distance model measures semantic distances between words using the local contexts of words within a single document as well as the lexical co-occurrence information in the set of documents to be retrieved. We also propose to integrate the context distance model with morphological analysis in determining word similarity so that the two can enhance each other. Using the standard vector-space model, we evaluated the proposed method on a subset of TREC-4 corpus (AP88 and AP90 collection, 158,240 documents, 49 queries). Results show that this method improves the 11-point average precision by 8.6%
Recommended from our members
Using word class for part-of-speech disambiguation
This paper presents a methodology for improving part-of-speech disambiguation using word classes. We build on earlier work for tagging French where we showed that statistical estimates can be computed without lexical probabilities. We investigate new directions for coming up with different kinds of probabilities based on paradigms of tags for given words. We base estimates not on the words, but on the set of tags associated with a word. We compute frequencies of unigrams, bigrams, and trigrams of word classes in order to further refine the disambiguation. This new approach gives a more efficient representation of the data in order to disambiguate word part-of-speech. We show empirical results to support our claim. We demonstrate that, besides providing good estimates for disambiguation, word classes solve some of the problems caused by sparse training data. We describe a part-of-speech tagger built on these principles and we suggest a methodology for developing an adequate training corpus
Using word class for part-of-speech disambiguation
This paper presents a methodology for improving part-of-speech disambiguation using word classes. We build on earlier work for tagging French where we showed that statistical estimates can be computed without lexical probabilities. We investigate new directions for coming up with different kinds of probabilities based on paradigms of tags for given words. We base estimates not on the words, but on the set of tags associated with a word. We compute frequencies of unigrams, bigrams, and trigrams of word classes in order to further refine the disambiguation. This ne
Recommended from our members
The automatic induction of concatenative units from machine readable dictionaries and corpora for speech synthesis
The purpose of this research is to determine the best method for deciding on an optimal set of concatenative units for concatenative speech synthesis. Of the two main approaches to speech synthesis: segmental synthesis and rule-based synthesis, the former relies heavily on the successful choice of concatenative units. Segment al synthesis consists of concatenating segmental units (diphones, triphones, etc); rule-based synthesis consists of the computation of control parameters based on pre-established rules. Deciding on the set of diphones is quite straightforward in the sense that it suffices to take the phoneme inventory of a language, and simply combine each phoneme with every other one. For example, taking the approximately 35 French phonemes, 1225 phonemic pairs (35x35) constitute the complete and exhaustive starting diphone inventory. On the other hand, deciding on the set of triphones, quadriphones and larger units raises difficult questions about the nature of phonemes in a given language such as: (1) stability vs instability in a coarticulatory environment, (2) size of overall inventory, and (3) frequency of that unit in the language, in combination with factors (1) and (2).
We report on experiments with four different databases, with comparisons between the resources regarding their n-gram frequency output. The first two databases consist of pronunciation field information from two dictionaries, the Encyclopedic Robert French dictionary with 85,000 headwords, and the smaller Collins Gem containing 15,000 words. For comparison, we use two text corpora, the Hansard (about 2.5 million words) and the smaller Tubach and Boe corpus (80,000 words); both corpora were processed by a set of grapheme-to-phoneme rules. A frequency extraction program was applied to all four resources to extract trigram phonemic frequencies; this serves as a basis for comparison between dictionary derived data and corpus derived, frequencies